Importing various libraries and previously created markdown from portfolio 1.
include <- function(library_name){
if( !(library_name %in% installed.packages()) )
install.packages(library_name)
suppressMessages(library(library_name, character.only=TRUE))
}
include("tidyverse")
include("knitr")
include("stringr")
include("caret")
include("rvest")
include("DT")
suppressMessages(purl("linkedin.Rmd", output = "part1.r"))
##
|
| | 0%
|
|...... | 9%
|
|............ | 18%
|
|.................. | 27%
|
|........................ | 36%
|
|.............................. | 45%
|
|................................... | 55%
|
|......................................... | 64%
|
|............................................... | 73%
|
|..................................................... | 82%
|
|........................................................... | 91%
|
|.................................................................| 100%
## [1] "part1.r"
source("part1.r")
Here, I will be scraping from a website called Seek (https://www.seek.com.au/). This seems like it is the biggest job search company that posts jobs in Australia. This scrape will consist of position, location, description, and pay. I also wanted to grab the company name, but for some reason, some companies want to keep themselves private so, they are under a different tag and could not be scraped. I have left this out due to the difficulty.
read_jobs <- function(x)
{
#created a tibble that consist of the scraped data
scrape <- tibble(position= as.character(),
location= as.character(),
description= as.character(),
pay= as.character()
)
#Looped through 200 since that is how many pages Seek provides in their search for all jobs.
for(i in 0:200)
{
url <- paste(x, i, sep = "")
site <- read_html(url)
#This is the most parent nodes.
data <- site %>%
html_nodes("article._2m3Is-x") #article.
#Grabbing pay data and using html_attr since it can also fill with NA if there is not tag.
pay <- data %>%
html_nodes("div.xxz8a1h > span:nth-child(3)") %>%
html_attr("aria-label")
#Grabbing position data.
position <- data %>%
html_attr("aria-label")
#Doing some initial cleaning to remove unnecessary information other than the position.
position <- gsub("-.*", "", position)
#Location's grandparent node.
location_grand <- data %>%
html_nodes("span._3FrNV7v")
#Location's parent node
location_parent <- location_grand %>%
html_nodes("strong.lwHBT6d")
#Grabbing location data
location <- location_parent %>%
html_nodes(".Eadjc1o") %>%
html_text()
#Initial cleaning to the location by removing 'location:'.
location <- gsub("location: ", "", location)
#Since there are 2 nodes under the .Eadjclo class, I only got the location node and removed the not location nodes. Every odd node was a location node.
location <- location[c(TRUE, FALSE)]
#Grabbing description nodes.
description <- data %>%
html_nodes(".bl7UwXp") %>%
html_text()
#Company grandparent's node.
company_grand <- site %>%
html_nodes("article._2m3Is-x > span:nth-child(5)")
#Company's parent node.
company_parent <- company_grand %>%
html_nodes("span")
#Grabbing company nodes.
company <- company_parent %>%
html_nodes("a._3AMdmRg") %>%
html_attr("aria-label")
#Creating a tibble with each of the attributes per page.
table <- tibble(position= position,
location= location,
description= description,
pay= pay
)
#Combining the tibble with the final tibble.
scrape <- rbind(scrape, table)
}
#Return the final tibble with all of the scraped data.
return(scrape)
}
scraped <- read_jobs("https://www.seek.com.au/jobs?page=")
Now that we scraped the data, we need to do some more cleaning.
#Extract the string where yearly salary is posted then remove the commas.
scraped$pay <- str_extract(scraped$pay, "\\d+,\\d+")
scraped$pay <- as.numeric(gsub(",","",scraped$pay))
#Remove rows where pay is NA.
scraped <- scraped[!is.na(scraped$pay), ]
#make location column a factor type
levels(scraped$location <- as.factor(scraped$location))
## [1] "ACT"
## [2] "Adelaide"
## [3] "Asia Pacific"
## [4] "Bendigo, Goldfields & Macedon Ranges"
## [5] "Blue Mountains & Central West"
## [6] "Brisbane"
## [7] "Bundaberg & Wide Bay Burnett"
## [8] "Cairns & Far North"
## [9] "Gold Coast"
## [10] "Gosford & Central Coast"
## [11] "Hobart"
## [12] "Kalgoorlie, Goldfields & Esperance"
## [13] "Katherine & Northern Australia"
## [14] "Lismore & Far North Coast"
## [15] "Melbourne"
## [16] "Newcastle, Maitland & Hunter"
## [17] "Perth"
## [18] "Richmond & Hawkesbury"
## [19] "South West Coast VIC"
## [20] "Southern Highlands & Tablelands"
## [21] "Sunshine Coast"
## [22] "Sydney"
## [23] "Toowoomba & Darling Downs"
## [24] "Wagga Wagga & Riverina"
## [25] "Wollongong, Illawarra & South Coast"
#since there are some factor levels with only 1 or 2 amount, we will only choose the ones with at least 3 amount.
table(scraped$location)
##
## ACT Adelaide
## 10 9
## Asia Pacific Bendigo, Goldfields & Macedon Ranges
## 2 1
## Blue Mountains & Central West Brisbane
## 1 39
## Bundaberg & Wide Bay Burnett Cairns & Far North
## 1 2
## Gold Coast Gosford & Central Coast
## 2 1
## Hobart Kalgoorlie, Goldfields & Esperance
## 2 1
## Katherine & Northern Australia Lismore & Far North Coast
## 2 2
## Melbourne Newcastle, Maitland & Hunter
## 143 2
## Perth Richmond & Hawkesbury
## 17 1
## South West Coast VIC Southern Highlands & Tablelands
## 1 1
## Sunshine Coast Sydney
## 2 271
## Toowoomba & Darling Downs Wagga Wagga & Riverina
## 1 2
## Wollongong, Illawarra & South Coast
## 1
scraped <- subset(scraped, location=="ACT" | location=="Adelaide" | location== "Brisbane" | location== "Melbourne" | location=="Perth" | location == "Sydney" )
#output of table onto html
datatable(scraped, options=list(pageLength=5))
# Analysis
Now, we can start investigating. I want to investigate if we could predict pay with where they are located. This will probably be invalid since there is not enough datapoints and that various jobs pay a different amount. However, we will not know until we find out and try. Let’s begin!
#randomly picking 70% data
sample_selection <- createDataPartition(scraped$pay, p=0.70, list=FALSE)
# #Splitting up 70% to go into our train. and 30% to go into our test.
train = scraped[sample_selection, ]
test = scraped[-sample_selection, ]
# #Linear model on our dependent variable (pay) with independent variables (location)
train_model <- lm(pay ~ factor(location), data = train)
summary(train_model)
##
## Call:
## lm(formula = pay ~ factor(location), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77862 -25062 -10062 4938 1719938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75000.0 43870.3 1.710 0.0883 .
## factor(location)Adelaide -13226.0 59400.7 -0.223 0.8239
## factor(location)Brisbane 409.5 47760.0 0.009 0.9932
## factor(location)Melbourne -4132.2 44830.4 -0.092 0.9266
## factor(location)Perth -16188.5 50657.1 -0.320 0.7495
## factor(location)Sydney 5062.0 44482.2 0.114 0.9095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98100 on 338 degrees of freedom
## Multiple R-squared: 0.003502, Adjusted R-squared: -0.01124
## F-statistic: 0.2376 on 5 and 338 DF, p-value: 0.9457
predictions <- train_model %>% predict(test)
#Let's plot it
ggplot(data = test, aes(x = predictions, y = pay)) +
geom_point() +
geom_smooth(method = "lm")
R2 <- R2(predictions, test$pay)
RMSE <- RMSE(predictions, test$pay)
MAE <- MAE(predictions, test$pay)
Looking at the summary of the train model, we see that none of the independent variables have any significant p values. Therefore, we cannot predict the amount of pay using location of where the position is. This concurs with my initial hypothesis where I have stated that this research would be invalid due to the insufficient amount of data and various types of jobs that pay at different rates.
This dataset fit really nicely with my first dataset because it has data on people’s profile picture beauty with calculated metrics. Initially, I did not know how to interpret this dataset since there was no documentation. I eventually reached out to the creator of the dataset, Andrew Truman (https://www.linkedin.com/in/kbot/), to ask him a few questions about it. He explained to me what all of the meterics meant and in what units they were. He also explained that he used a software called Face++ (https://www.faceplusplus.com/) to examine profile pictures which used machine learning to give mumerical values to profile pictures. This dataset has the following attributes.
| Variable | Type | Desription |
|---|---|---|
| X | int | index of each profile |
| avg_n_pos_per_tenure | int | Average number of positions per tenure |
| avg_pos_len | int | Average position length |
| avg_prev_tenure_len | int | Average previous tenure length |
| c_name | character | Company name |
| n_pos | int | Number of positions |
| n_prev_tenures | int | Number of previous tenures |
| tenure_len | int | Tenure length |
| age | int | Age estimate |
| beauty | int | Beauty estimate |
| beauty_female | int | Beauty estimate as female |
| beauty_male | int | Beauty esimate as male |
| blur | int | Blur level estimate |
| blur_gaussian | int | Another blue level estimate |
| blur_motion | int | Blur motion estimate |
| emo_anger | int | Anger level estimate |
| emo_disgust | int | Disgust level estimate |
| emo_fear | int | Fear level estimate |
| emo_happiness | int | Happiness level estimate |
| emo_neutral | int | Neutral level estimate |
| emo_sadness | int | Sadness level estimate |
| emo_surprise | int | Suprise level estimate |
| ethnicity | string/categorical | Ethnicity estimate |
| face_quality | int | Face quality estimate |
| gender | int | Gender estimate |
| glass | string/categorical | Dark, None, or Normal |
| head_pitch | int | Head pitch estimate |
| head_roll | int | Head roll estimate |
| head_yaw | int | Head Yaw estimate |
| mouth_close | int | Mouth close estimate |
| mouth_mask | int | Mouth Mask estimate |
| mouth_open | int | Mouth open estimate |
| mouth_other | int | Mouth other estimate |
| skin_acne | int | Skin acne estimate |
| skin_dark_circle | int | Skin dark circle estimate |
| skin_health | int | Skin health estimate |
| skin_stain | int | Skin stain estimate |
| smile | int | Smile estimate |
| african | int | Africa estimate |
| celtic_english | int | Celtic English estimate |
| east_asian | int | East Asian estiamte |
| european | int | European estimate |
| greek | int | Greek estimate |
| hispanic | int | Hispanic estimate |
| jewish | int | Jewish estiamte |
| muslim | int | Muslim estimate |
| nationality | string | Nationality estimate |
| nordic | int | Nordic estimate |
| south_asian | int | South Asian estimate |
| n_followers | int | Number of followers on linkedin |
With this dataset, I want to explore how various work history metrics can predict a person’s beauty (dependent variable). First, we need to import the data and do some initial cleaning.
In this segment, I will try to find if there is any correlation between beauty and other work history metrics. I believe that there is some correlation between how beauty can affect a person’s work. I will make beauty the dependent variable while average position length, average previous tenure length, and tenure length will be the independent variables. But first, we need to do some intiial cleaning.
#Import dataset
linkedin2 <- read.csv("linkedin_data.csv")
#delete columns we don't need.
linkedin2$m_urn <- NULL
linkedin2$img <- NULL
#rename column to company_name instead for clarity.
colnames(linkedin2)[colnames(linkedin2)=="c_name"] <- "company_name"
#change company_name to a factors.
levels(linkedin2$company_name) <- as.factor(linkedin2$company_name)
datatable(linkedin2, options=list(pageLength=10))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html